[fix](ggml-cuda): ensure min 1 block per SM #16633

catan2001 · 2025-10-17T09:41:12Z

Description

Testing llama.cpp with Llama 3.2 model on RX 6700XT caused a floating point exception (SIGFPE) when launching the FlashAttention kernel (fattn_kernel):

Thread 1 "llama-cli" received signal SIGFPE, Arithmetic exception. 
0x000079c19ca40223 in void launch_fattn<64, 16, 1>(ggml_backend_cuda_context&, ggml_tensor*, void (*)(char const*, char const*, char const*, char const*, char const*, int const*, float*, HIP_vector_type<float, 2u>*, float, float, float, float, unsigned int, float, int, int, int, int, int, int, int, int, int, int, int, int, int, long, int, int, long, int, int, int, int, int, long), int, unsigned long, int, bool, bool, bool, int) () 
from /therock/src/external-builds/llama.cpp/llama.cpp/build/bin/libggml-hip.so

Technical Explanation

The issue occurs because cudaOccupancyMaxActiveBlocksPerMultiprocessor sometimes returns 0 for max_blocks_per_sm due to high shared memory or register usage. This value is later used in a division:

const int max_blocks = max_blocks_per_sm * nsm;
const int tiles_nwaves = (ntiles_total + max_blocks - 1) / max_blocks;

When max_blocks_per_sm is 0, this causes a division by zero, triggering the FPE.

Fix:

Add a safeguard to ensure at least one block is launched:

max_blocks_per_sm = std::max(max_blocks_per_sm, 1);

Some kernel configurations can produce zero occupancy on certain GPUs (example: RX 6700XT). This adds a safeguard to ensure at least one block is launched, preventing floating point exception. Co-authored-by: Attila Dusnoki <[email protected]>.

JohannesGaessler · 2025-10-17T09:55:33Z

This seems like the wrong fix. Under which circumstances is an occupancy of 0 returned?

catan2001 · 2025-10-17T10:06:24Z

This seems like the wrong fix. Under which circumstances is an occupancy of 0 returned?

@JohannesGaessler Hi, sorry if my initial description wasn’t clear enough. So this happens when I run llama-cli with the Llama-3.2 3B. Specifically, the error seems to be caused by max_blocks_per_sm being set to 0. This was done on AMD RX 6700XT.

Here is small print log:

FATTN: max_blocks_per_sm = 2
FATTN: max_blocks_per_sm = 2
FATTN: max_blocks_per_sm = 2
FATTN: max_blocks_per_sm = 2
FATTN: max_blocks_per_sm = 2
FATTN: max_blocks_per_sm = 0    < causes the fpe

JohannesGaessler · 2025-10-17T13:10:21Z

I don't understand why this is happening. According to the documentation GFX 1030 (which I use for testing) and GFX 1031 have the same amount of SRAM and registers so I would expect them to be able to achieve the same occupancy. I am not seeing any warnings about failing to meet occupancy targets in the compilation log, both for GFX 1030 and for GFX 1031. Please provide me with the exact commands you used to compile and run llama.cpp. It would also be helpful if you could provide me with the values for nwarps and nbytes_shared for the failing case.

catan2001 · 2025-10-17T16:15:08Z

I don't understand why this is happening. According to the documentation GFX 1030 (which I use for testing) and GFX 1031 have the same amount of SRAM and registers so I would expect them to be able to achieve the same occupancy. I am not seeing any warnings about failing to meet occupancy targets in the compilation log, both for GFX 1030 and for GFX 1031. Please provide me with the exact commands you used to compile and run llama.cpp. It would also be helpful if you could provide me with the values for nwarps and nbytes_shared for the failing case.

Sorry for the delay. I'll go ahead and close the PR, as I've identified the issue: it was related to using TheRock as the build environment for ROCm. After testing with prebuilt ROCm versions 6.4.1 and 7.0.2, everything works correctly without triggering the floating point exception.

catan2001 requested a review from JohannesGaessler as a code owner October 17, 2025 09:41

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 17, 2025

catan2001 closed this Oct 17, 2025

JohannesGaessler mentioned this pull request Oct 17, 2025

CUDA: better error for FA kernel with 0 occupancy #16643

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[fix](ggml-cuda): ensure min 1 block per SM #16633

[fix](ggml-cuda): ensure min 1 block per SM #16633

Uh oh!

catan2001 commented Oct 17, 2025

Uh oh!

JohannesGaessler commented Oct 17, 2025

Uh oh!

catan2001 commented Oct 17, 2025 •

edited

Loading

Uh oh!

JohannesGaessler commented Oct 17, 2025

Uh oh!

catan2001 commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[fix](ggml-cuda): ensure min 1 block per SM #16633

[fix](ggml-cuda): ensure min 1 block per SM #16633

Uh oh!

Conversation

catan2001 commented Oct 17, 2025

Description

Technical Explanation

Fix:

Uh oh!

JohannesGaessler commented Oct 17, 2025

Uh oh!

catan2001 commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Oct 17, 2025

Uh oh!

catan2001 commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

catan2001 commented Oct 17, 2025 •

edited

Loading